Evaluating methods for Lasso selective inference in biomedical research: a comparative simulation study

Background Variable selection for regression models plays a key role in the analysis of biomedical data. However, inference after selection is not covered by classical statistical frequentist theory, which assumes a fixed set of covariates in the model. This leads to over-optimistic selection and replicability issues. Methods We compared proposals for selective inference targeting the submodel parameters of the Lasso and its extension, the adaptive Lasso: sample splitting, selective inference conditional on the Lasso selection (SI), and universally valid post-selection inference (PoSI). We studied the properties of the proposed selective confidence intervals available via R software packages using a neutral simulation study inspired by real data commonly seen in biomedical studies. Furthermore, we present an exemplary application of these methods to a publicly available dataset to discuss their practical usability. Results Frequentist properties of selective confidence intervals by the SI method were generally acceptable, but the claimed selective coverage levels were not attained in all scenarios, in particular with the adaptive Lasso. The actual coverage of the extremely conservative PoSI method exceeded the nominal levels, and this method also required the greatest computational effort. Sample splitting achieved acceptable actual selective coverage levels, but the method is inefficient and leads to less accurate point estimates. The choice of inference method had a large impact on the resulting interval estimates, thereby necessitating that the user is acutely aware of the goal of inference in order to interpret and communicate the results. Conclusions Despite violating nominal coverage levels in some scenarios, selective inference conditional on the Lasso selection is our recommended approach for most cases. If simplicity is strongly favoured over efficiency, then sample splitting is an alternative. If only few predictors undergo variable selection (i.e. up to 5) or the avoidance of false positive claims of significance is a concern, then the conservative approach of PoSI may be useful. For the adaptive Lasso, SI should be avoided and only PoSI and sample splitting are recommended. In summary, we find selective inference useful to assess the uncertainties in the importance of individual selected predictors for future applications. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01681-y.


Supplementary Figure S 1: Graphical illustration of two views of post-selection inference.
Berk and colleagues (1) introduced two alternative views on post-selection inference. In the full model view, the population parameters of interest are those of the full set of candidate predictors, possibly including interactions or non-linear terms. The "full" model comprising all candidate predictors is the object of interest for future research and is assumed to be a description of the data generating mechanism. Variable selection therefore merely amounts to forcing some of the coefficients in the estimated model to zero, but a corresponding population parameter still exists and serves as the target of post-selection inference. In the alternative submodel view, interest lies in the parameters of the selected variables only. The full model does not have a special meaning, as a models' primary purpose is to provide a succinct description of the association of the outcome and the independent variables, not necessarily capturing the data generating mechanism. The hypotheses to be assessed, and the population quantities, therefore depend on the selected submodel.
A submodel of the full model can be interpreted as a linear approximation to the full model with no requirements regarding the correctness of its parameters, i.e. that its coefficients are the same as for the full model. In fact the full model and submodel targets differ unless there is no correlation, and interval estimates, indicated by black lines, are different (likely wider) between full model and submodel inference due to the need to account for variable selection.

Simulation profile
This brief overview provides a high-level summary and pointers to the essential definitions of the simulation study in this manuscript.

Design
Aim: To evaluate recent proposals for selective inference in the context of Lasso regression regarding their frequentist properties in practical usage scenarios.
Data generation: Aimed at studying different correlation patterns and effect distribution in sparse settings. The data generation procedure is outlined in the figure below.
Two setups with different pre-specifications (notation referencing figure above): Primary estimands: selective coverage, power and type 1 error (see Table 3) Further estimands: true model and variable selection frequencies, width of confidence intervals, prediction accuracy, inference stability.
Methods: Lasso and adaptive Lasso for variable selection, tuned with either cross-validation (CV) or using a fixed estimated penalization strength following Negahban et al (2). Sample splitting, or the approaches by Berk et al (1) and Lee et al (3) for selective inference (see Table 2).

Results
Primary estimands: Selective coverage and type 1 error mostly acceptable for all inference approaches for Lasso, undercoverage for adaptive Lasso tuned by CV for the Lee et al method; selective power generally lower for Berk et al approach but very low type 1 error ( Figure 1,

Simulation study details
Code to generate the datasets used in this study is available from https://github.com/matherealize/LassoSI.

Derivation of estimators
We present the derivation of our simulation approximation of the selective estimands exemplary for the case of selective coverage after a certain variable selection procedure of interest. is simply counting the total number of selection events for the variable selection procedure over all iterations of the simulation scenario.
In practice, the computations simplify drastically, resulting in the estimators shown in Table 3.
For example, the approximation of overall selective coverage can be explicitly obtained from the equations above as which outlines significant simplifications and yields the result as given in Table 3 in the main manuscript. Note that the equation on the right hand can then be easily interpreted as "evaluate whether a CI covers its target parameter, whenever a CI is available".
In the case of a conditional quantity, i.e. conditional on a specific variable of interest being selected, the analogous interpretation would be "evaluate whether a CI for variable covers its target parameter, whenever a CI for is available (i.e. was selected)".

Toy simulation setup
The toy setup was kept extremely simple with four standard normal distributed candidate predictors in order to provide a focused assessment of the methods in our study. The rationale was that more variables just meant more (hard to control) noise and a larger number of essentially equivalent ways to distribute effects, blurring the conclusions and making the results less intelligible. Using this distilled setup allowed us to study a variety of 7 different structures for the true correlation matrix:

Realistic simulation setup
The realistic setup featured a fixed, complex correlation structure and different kinds of variable distributions based on real clinical data as presented in (8). After drawing standard normal data using a pre-specified correlation matrix (see Supplementary Figure S  This setup was based on the design presented in (8). Example code demonstrating how the data was simulated using the simdata package (7) is provided in the file Realistic_Setup_Demo.R in the online code repository.

Correlation structure
The correlation matrix used for drawing data from a multivariate normal distribution was modified from the original to feature stronger correlations. The following network depicts the final correlation structure. Individual variables as nodes in a graph, correlations between two variables are indicated as an edge with specified correlation coefficient.

Data transformation
Data from the initial multivariate distribution was transformed to achieve different variable distributions. The transformations were the same as in the original publication: Supplementary

R package details
The algorithm implemented in the selectiveInference package involved a grid search for the bounds of the selective CIs for which the search space was extended from the default values to the interval [-1000, 1000] with 1000 interval steps. As a side note, some minor bugs were fixed in the package to ensure correct computation of the CIs. These concern logistic regression and have also been reported as issues on the package's Github repository.
 Confidence intervals for logistic regression may have the wrong sign. After studying the code, we believe this happens when the sign of the coefficient is negative, in which case the interval needs to be flipped.
 The passed significance level for logistic regression is ignored. A simple fix for this was done in the software used for this simulation study.
The algorithm for the computation of the PoSI constant as implemented in the PoSI package involved numeric simulations. Due to the computational burden, we reduced the number of these internal simulations to 500 for the realistic setup only.  Figure 1 of the main manuscript. By comparing different panels, we demonstrate how the different methods behave in different scenarios. We show here a subset of all scenarios for two design factors of the simulation study (see Table 1  In this plot, we can observe that the Split method yields similar results for all scenarios and for both Lasso and adaptive Lasso, demonstrating the robustness of the method. The PoSI method is shown to be conservative, which is demonstrated by e.g. the comparison of columns "v1" and "v1234". Due to very wide CIs, the method has low selective power which is why the selective coverage is too high in column "v1234", but closer to the nominal level in column "v1". For the SI method it is also interesting to compare these columns. In contrast to PoSI, SI generally struggles with higher selective type 1 error, and thus the selective coverage is closer to nominal in column "v1234" than in column "v1". This is particularly obvious when there are strong correlations between the variables, e.g .in rows "correlated" and "correlated_neg". Lastly, the Neg method to tune the Lasso yields extremely sparse results, which translates to highly variable results for the different scenarios, but in particular in scenarios with negative corelations (row "correlated_neg"). Generally, the behaviour of all methods here is outlined in Table 4 in the main manuscript. Results indicate that the probability to select the true model increases with higher R² for the ALasso approaches, but remains largely constant for the Lasso approaches. Similarly, the probability to select false positives decreases for ALasso, but increases slightly for the Lasso.

Computational aspects
Computing time is an important aspect in practical applications, especially with many candidate predictors. As expected, the Lasso-CV-PoSI and ALasso-CV-PoSI approaches showed an exponential growth when comparing the toy and the realistic setup, while all of the other methods only showed minimal increases. The computation of the PoSI constant for a single iteration of the realistic setup with 17 candidate predictors took on average 35 seconds (Intel Core i7 4790 @ 3.6 GHz, R with multithreaded OpenBLAS matrix library). All the other methods remained well below 0.5 seconds.